Intermediate Pandas : Joins, slices, and subsetting

A short workshop run by the Library Bioinformatics Service

tinyurl.com/wcmpandas02

Based on the Data Carpentry curriculum for Data Visualization in Python ( © Data Carpentry under Creative Commons Attribution license )

Our Data

As for the last lesson, we will be using NCD Risk Factor Collaboration (NDC-RisC) data, from Worldwide trends in body-mass index, underweight, overweight, and obesity from 1975 to 2016: a pooled analysis of 2416 population-based measurement studies in 128.9 million children, adolescents, and adults. Lancet 2017, published online 11 October 2017

We will be using the adult BMI dataset country-specific data and the height at age 18 country specific data.

The downloaded filename is NCD_RisC_Lancet_2017_BMI_age_standardised_country.csv, but we have saved a simplified version of the file here as NCD_RisC_bmi.csv.

The table of arable land (hectares per person) is taken from the World Bank Databank

The table of life expectancy is taken from our world in data

In many "real world" situations, the data that we want to use come in multiple files. We often need to combine these files into a single DataFrame to analyze the data. The pandas package provides various methods for combining DataFrames including merge and concat.

To work through the examples below, we first need to load the species and surveys files into pandas DataFrames. In iPython:



In [1]:

    
# import pandas library:
import pandas as pd

# set the plots to appear 'inline' in the notebook:
%matplotlib inline



In [2]:

    
df_bmi = pd.read_csv("data/NCD_RisC_bmi.csv")
df_bmi.head()









    Out[2]:







  
    
      
      iso
      sex
      year
      bmi
      lower95ci
      upper95ci
      obese_pc
      obese_lower95ci
      obese_upper95ci
    
  
  
    
      0
      AFG
      Men
      1975
      18.999440
      16.331119
      21.763077
      0.002078
      0.000384
      0.006515
    
    
      1
      COM
      Men
      1987
      21.212235
      19.678790
      22.755298
      0.008317
      0.002737
      0.019086
    
    
      2
      COK
      Women
      1975
      28.988165
      26.359859
      31.612574
      0.359538
      0.255615
      0.469956
    
    
      3
      DJI
      Women
      1985
      23.543410
      20.425051
      26.709342
      0.099453
      0.054829
      0.156816
    
    
      4
      FRA
      Women
      1995
      24.406912
      23.795591
      25.022224
      0.159790
      0.122057
      0.202200



In [3]:

    
df_height = pd.read_csv("data/NCD_RisC_height.csv")
df_height.head()









    Out[3]:







  
    
      
      Country
      ISO
      Sex
      Year of birth
      Mean height (cm)
      Mean height lower 95% uncertainty interval (cm)
      Mean height upper 95% uncertainty interval (cm)
    
  
  
    
      0
      Afghanistan
      AFG
      Men
      1896
      161.164095
      154.484285
      167.754033
    
    
      1
      Afghanistan
      AFG
      Men
      1897
      161.196286
      154.571603
      167.659618
    
    
      2
      Afghanistan
      AFG
      Men
      1898
      161.228297
      154.707340
      167.602576
    
    
      3
      Afghanistan
      AFG
      Men
      1899
      161.260727
      154.835644
      167.528113
    
    
      4
      Afghanistan
      AFG
      Men
      1900
      161.293068
      154.959540
      167.508077

Take note that the read_csv method we used can take some additional options. Many functions in python have a set of options that can be set by the user if needed. More about all of the read_csv options here.

Concatenating DataFrames

We can use the concat function in Pandas to append either columns or rows from one DataFrame to another. Let's grab two subsets of our data to see how this works.



In [4]:

    
# Read in first 10 lines of bmi table
bmi_sub_first10 = df_bmi.head(10)

# Grab the last 10 rows
bmi_sub_last10 = df_bmi.tail(10)



In [7]:

    
# Reset the index values so the second dataframe appends properly
bmi_sub_last10=bmi_sub_last10.reset_index(drop=True)
# drop=True option avoids adding new index column with old index values



In [11]:

    
bmi_sub_first10









    Out[11]:







  
    
      
      iso
      sex
      year
      bmi
      lower95ci
      upper95ci
      obese_pc
      obese_lower95ci
      obese_upper95ci
    
  
  
    
      0
      AFG
      Men
      1975
      18.999440
      16.331119
      21.763077
      0.002078
      0.000384
      0.006515
    
    
      1
      COM
      Men
      1987
      21.212235
      19.678790
      22.755298
      0.008317
      0.002737
      0.019086
    
    
      2
      COK
      Women
      1975
      28.988165
      26.359859
      31.612574
      0.359538
      0.255615
      0.469956
    
    
      3
      DJI
      Women
      1985
      23.543410
      20.425051
      26.709342
      0.099453
      0.054829
      0.156816
    
    
      4
      FRA
      Women
      1995
      24.406912
      23.795591
      25.022224
      0.159790
      0.122057
      0.202200
    
    
      5
      JPN
      Men
      1987
      22.530648
      22.338057
      22.722702
      0.010894
      0.007791
      0.014741
    
    
      6
      MKD
      Women
      1996
      25.423879
      24.099956
      26.773022
      0.183430
      0.128061
      0.247452
    
    
      7
      SVK
      Women
      1990
      24.985383
      24.174007
      25.834639
      0.150873
      0.106447
      0.200941
    
    
      8
      SWE
      Men
      1982
      24.106473
      23.748258
      24.462109
      0.091508
      0.068023
      0.119301
    
    
      9
      AFG
      Men
      1976
      19.105518
      16.500077
      21.801109
      0.002237
      0.000437
      0.006820

When we concatenate DataFrames, we need to specify the axis. axis=0 tells Pandas to stack the second DataFrame under the first one. It will automatically detect whether the column names are the same and will stack accordingly. axis=1 will stack the columns in the second DataFrame to the RIGHT of the first DataFrame. To stack the data vertically, we need to make sure we have the same columns and associated column format in both datasets. When we stack horizonally, we want to make sure what we are doing makes sense (ie the data are related in some way).



In [8]:

    
# Stack the DataFrames on top of each other
vertical_stack = pd.concat([bmi_sub_first10, bmi_sub_last10], axis=0)

# Place the DataFrames side by side
horizontal_stack = pd.concat([bmi_sub_first10, bmi_sub_last10], axis=1)



In [13]:

    
vertical_stack.reset_index(drop=True)









    Out[13]:







  
    
      
      iso
      sex
      year
      bmi
      lower95ci
      upper95ci
      obese_pc
      obese_lower95ci
      obese_upper95ci
    
  
  
    
      0
      AFG
      Men
      1975
      18.999440
      16.331119
      21.763077
      0.002078
      0.000384
      0.006515
    
    
      1
      COM
      Men
      1987
      21.212235
      19.678790
      22.755298
      0.008317
      0.002737
      0.019086
    
    
      2
      COK
      Women
      1975
      28.988165
      26.359859
      31.612574
      0.359538
      0.255615
      0.469956
    
    
      3
      DJI
      Women
      1985
      23.543410
      20.425051
      26.709342
      0.099453
      0.054829
      0.156816
    
    
      4
      FRA
      Women
      1995
      24.406912
      23.795591
      25.022224
      0.159790
      0.122057
      0.202200
    
    
      5
      JPN
      Men
      1987
      22.530648
      22.338057
      22.722702
      0.010894
      0.007791
      0.014741
    
    
      6
      MKD
      Women
      1996
      25.423879
      24.099956
      26.773022
      0.183430
      0.128061
      0.247452
    
    
      7
      SVK
      Women
      1990
      24.985383
      24.174007
      25.834639
      0.150873
      0.106447
      0.200941
    
    
      8
      SWE
      Men
      1982
      24.106473
      23.748258
      24.462109
      0.091508
      0.068023
      0.119301
    
    
      9
      AFG
      Men
      1976
      19.105518
      16.500077
      21.801109
      0.002237
      0.000437
      0.006820
    
    
      10
      ZWE
      Women
      2007
      25.114254
      24.680130
      25.557846
      0.222229
      0.173664
      0.277528
    
    
      11
      ZWE
      Women
      2008
      25.148895
      24.703729
      25.602596
      0.226768
      0.177657
      0.282221
    
    
      12
      ZWE
      Women
      2009
      25.182664
      24.718421
      25.646877
      0.231246
      0.181843
      0.287981
    
    
      13
      ZWE
      Women
      2010
      25.216258
      24.730675
      25.696082
      0.235707
      0.185437
      0.292872
    
    
      14
      ZWE
      Women
      2011
      25.250409
      24.738483
      25.759248
      0.240192
      0.188258
      0.299295
    
    
      15
      ZWE
      Women
      2012
      25.284840
      24.743206
      25.821593
      0.244698
      0.190822
      0.305243
    
    
      16
      ZWE
      Women
      2013
      25.319948
      24.741024
      25.898562
      0.249242
      0.193386
      0.311415
    
    
      17
      ZWE
      Women
      2014
      25.356222
      24.722863
      25.987488
      0.253836
      0.195761
      0.318625
    
    
      18
      ZWE
      Women
      2015
      25.393409
      24.719608
      26.084431
      0.258485
      0.197569
      0.325454
    
    
      19
      ZWE
      Women
      2016
      25.431487
      24.701907
      26.192666
      0.263240
      0.199351
      0.333036



In [ ]:

Row Index Values and Concat

Have a look at the vertical_stack dataframe? Notice anything unusual? The row indexes for the two data frames bmi_sub_first10 and bmi_sub_last10 have been repeated. We can reindex the new dataframe using the reset_index() method.



In [ ]:

    
# use this cell to view your vertical_stack dataframe



In [ ]:

    
# use this cell to view your horizontal_stack dataframe



In [ ]:

    
# use this cell to re-index the vertical_stack dataframe

Writing Out Data to CSV

We can use the to_csv command to do export a DataFrame in CSV format. Note that the code below will by default save the data into the current working directory. We can save it to a different folder by adding the foldername and a slash to the file vertical_stack.to_csv('foldername/out.csv'). We use the 'index=False' so that pandas doesn't include the index number for each line.



In [22]:

    
df_1 = pd.DataFrame({"A":[1,2,3,4], "B":[3,4,5,6]})
df_2 = pd.DataFrame({"C":[1,2,3,4], "B":[3,4,5,6]})
df_1



In [23]:

    
df_2



In [24]:

    
pd.concat([df_1, df_2], axis=0)



In [ ]:

    
# Write DataFrame to CSV
vertical_stack.to_csv('data/vertical_stack.csv', index=False)

Check out your working directory to make sure the CSV wrote out properly, and that you can open it! If you want, try to bring it back into python to make sure it imports properly.

Make sure that any directory you are specifying has already been created.



In [ ]:

    
# For kicks read our output back into python and make sure all looks good
new_output = pd.read_csv('data/vertical_stack.csv')

Challenge - Combine Data
In the data folder, there are two survey data files: NCD_RisC_height_1896.csv and NCD_RisC_height_1996.csv. Read the data into python and combine the files to make one new data frame. Create a plot of average plot weight by year grouped by sex. Export your results as a CSV and make sure it reads back into python properly.



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Joining DataFrames

When we concatenated our DataFrames we simply added them to each other - stacking them either vertically or side by side. Another way to combine DataFrames is to use columns in each dataset that contain common values (a common unique id). Combining DataFrames using a common field is called "joining". The columns containing the common values are called "join key(s)". Joining DataFrames in this way is often useful when one DataFrame is a "lookup table" containing additional data that we want to include in the other.

NOTE: This process of joining tables is similar to what we do with tables in an SQL database.

For example, we might create a lookup table for our country data. This table could contain information such as economic indicators, population, area for each of the countries. The country code would be unique for each line. Rather than adding multiple extra columns to each of the 16,800 lines of the bmi data table, we can maintain the shorter table with the country information. When we want to access that information, we can create a query that joins the additional columns of information to the Survey data.

Storing data in this way has many benefits including:

It ensures consistency in the spelling of species attributes (genus, species and taxa) given each species is only entered once. Imagine the possibilities for spelling errors when entering the genus and species thousands of times!
It also makes it easy for us to make changes to the species information once without having to find each instance of it in the larger survey data.
It optimizes the size of our data.

Joining Two DataFrames

To better understand joins, let's grab the first 10 lines of our data as a subset to work with. We'll use the .head method to do this. We'll also read in a subset of the hieght table.



In [25]:

    
# Read in first 10 lines of bmi table
bmi_sub = df_bmi.head(10)

# Import a small subset of the species data designed for this part of the lesson.
# It is stored in the data folder.
df_arable = pd.read_csv('data/arable_land.csv')
arable_sub = df_arable.head(70)

In this example, df_arable is the lookup table containing country code, country, and hectares of arable land per person, that we want to join with the data in survey_sub to produce a new DataFrame that contains all of the columns from both df_bmi and df_arable.

Identifying join keys

To identify appropriate join keys we first need to know which field(s) are shared between the files (DataFrames). We might inspect both DataFrames to identify these columns. If we are lucky, both DataFrames will have columns with the same name that also contain the same data. If we are less lucky, we need to identify a (differently-named) column in each DataFrame that contains the same information.



In [26]:

    
arable_sub.columns









    Out[26]:





Index(['iso', 'country_name', 'arable_land_ha_pp'], dtype='object')



In [27]:

    
bmi_sub.columns









    Out[27]:





Index(['iso', 'sex', 'year', 'bmi', 'lower95ci', 'upper95ci', 'obese_pc',
       'obese_lower95ci', 'obese_upper95ci'],
      dtype='object')

In our example, the join key is the column containing the three-letter country identifier, which is called iso.

Now that we know the fields with the common country ID attributes in each DataFrame, we are almost ready to join our data. However, since there are different types of joins, we also need to decide which type of join makes sense for our analysis.

Inner joins

The most common type of join is called an inner join. An inner join combines two DataFrames based on a join key and returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.

Inner joins yield a DataFrame that contains only rows where the value being joined exists in BOTH tables. An example of an inner join, adapted from this page is below:

The pandas function for performing joins is called merge and an Inner join is the default option:



In [32]:

    
merged_inner = pd.merge(left=bmi_sub,
                        right=arable_sub, 
                        left_on='iso', 
                        right_on='iso')
# In this case `species_id` is the only column name in  both dataframes, so if we skippd `left_on`
# And `right_on` arguments we would still get the same result

# What's the size of the output data?
merged_inner.shape
merged_inner
type(merged_inner)









    Out[32]:





pandas.core.frame.DataFrame



In [33]:

    
merged_inner.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 11 columns):
iso                  5 non-null object
sex                  5 non-null object
year                 5 non-null int64
bmi                  5 non-null float64
lower95ci            5 non-null float64
upper95ci            5 non-null float64
obese_pc             5 non-null float64
obese_lower95ci      5 non-null float64
obese_upper95ci      5 non-null float64
country_name         5 non-null object
arable_land_ha_pp    5 non-null float64
dtypes: float64(7), int64(1), object(3)
memory usage: 480.0+ bytes

The result of an inner join of bmi_sub and arable_sub is a new DataFrame that contains the combined set of columns from bmi_sub and arable_sub. It only contains rows that have three-letter country codes that are the same in both the bmi_sub and arable_sub DataFrames. In other words, if a row in bmi_sub has a value of iso that does not appear in the iso column of arable_sub, it will not be included in the DataFrame returned by an inner join. Similarly, if a row in arable_sub has a value of iso that does not appear in the iso column of bmi_sub, that row will not be included in the DataFrame returned by an inner join.

The two DataFrames that we want to join are passed to the merge function using the left and right argument. The left_on='iso' argument tells merge to use the iso column as the join key from bmi_sub (the left DataFrame). Similarly , the right_on='iso' argument tells merge to use the iso column as the join key from arable_sub (the right DataFrame). For inner joins, the order of the left and right arguments does not matter.

The resulting merged_inner DataFrame contains all of the columns from bmi_sub (iso, year, bmi, etc.) as well as all the columns from arable_sub (iso, country, and arable land per person).

Notice that merged_inner has fewer rows than arable_sub. This is an indication that there were rows in bmi_sub with value(s) for iso that do not exist as value(s) for iso in arable_sub.

Left joins

What if we want to add information from arable_sub to bmi_sub without losing any of the information from bmi_sub? In this case, we use a different type of join called a "left outer join", or a "left join".

Like an inner join, a left join uses join keys to combine two DataFrames. Unlike an inner join, a left join will return all of the rows from the left DataFrame, even those rows whose join key(s) do not have values in the right DataFrame. Rows in the left DataFrame that are missing values for the join key(s) in the right DataFrame will simply have null (i.e., NaN or None) values for those columns in the resulting joined DataFrame.

Note: a left join will still discard rows from the right DataFrame that do not have values for the join key(s) in the left DataFrame.

A left join is performed in pandas by calling the same merge function used for inner join, but using the how='left' argument:



In [ ]:

    
merged_left = pd.merge(left=bmi_sub,
                       right=arable_sub, 
                       how='left', 
                       left_on='iso', 
                       right_on='iso')

merged_left

The result DataFrame from a left join (merged_left) looks very much like the result DataFrame from an inner join (merged_inner) in terms of the columns it contains. However, unlike merged_inner, merged_left contains the same number of rows as the original bmi_sub DataFrame. When we inspect merged_left, we find there are rows where the information that should have come from arable_sub (i.e., country_name, and arable_land_ha_pp) is missing (they contain NaN values):



In [ ]:

    
merged_left[ pd.isnull(merged_left.country_name) ]

These rows are the ones where the value of iso from bmi_sub (in this case, SVK, JPN, MKD, etc) does not occur in arable_sub.

Other join types

The pandas merge function supports two other join types:

Right (outer) join: Invoked by passing how='right' as an argument. Similar to a left join, except all rows from the right DataFrame are kept, while rows from the left DataFrame without matching join key(s) values are discarded.
Full (outer) join: Invoked by passing how='outer' as an argument. This join type returns the all pairwise combinations of rows from both DataFrames; i.e., the result DataFrame will NaN where data is missing in one of the dataframes. This join type is very rarely used.

Final Challenges

Challenge - Distributions
Create a new DataFrame by joining the contents of the NCD_RisC_bmi.csv and NCD_RisC_height.csv tables. Then calculate and plot the distribution of:

bmi by country

bmi by sex by country

Challenge - Weight by geographical region

In the data folder, there is a CSV that contains information about the countries assigned to different geographical regions. Use that data to summarize the number of countries by region.

Calculate an expected mean weight of each region, using the bmi and height data.



In [ ]:

    
df_regions = pd.read_csv('data/world_regions.csv')
df_bmi = pd.read_csv('data/NCD_RisC_bmi.csv')
df_height = pd.read_csv('data/NCD_RisC_height.csv')



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Slicing Subsets of Rows and Columns in Python

We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.

loc is primarily label based indexing. Integers may be used but they are interpreted as a label.
iloc is primarily integer based indexing

To select a subset of rows and columns from our DataFrame, we can use the iloc method. For example, we can select month, day and year (columns 2, 3 and 4 if we start counting at 1), like this:



In [ ]:

    
# iloc[row slicing, column slicing]
df_regions.iloc[0:3, 1:4]

which gives the output

    alpha-3 region  sub-region
0   AFG Asia    Southern Asia
1   ALA Europe  Northern Europe
2   ALB Europe  Southern Europe

Notice that we asked for a slice from 0:3. This yielded 3 rows of data. When you ask for 0:3, you are actually telling Python to start at index 0 and select rows 0, 1, 2 up to but not including 3.

Let's explore some other ways to index and select subsets of data:



In [ ]:

    
# Select all columns for rows of index values 0 and 10
df_regions.loc[[0, 10], :]



In [ ]:

    
# What does this do?
df_regions.loc[0, ['name', 'alpha-3']]



In [ ]:

    
# What happens when you type the code below?
df_regions.loc[[0, 10, 149], :]

NOTE: Labels must be found in the DataFrame or you will get a KeyError.

Indexing by labels loc differs from indexing by integers iloc. With loc, the both start bound and the stop bound are inclusive. When using loc, integers can be used, but the integers refer to the index label and not the position. For example, using loc and select 1:4 will get a different result than using iloc to select rows 1:4.

We can also select a specific data value using a row and column location within the DataFrame and iloc indexing:

# Syntax for iloc indexing to finding a specific data element
dat.iloc[row, column]



In [ ]:

    
df_regions.iloc[23, 3]

Remember that Python indexing begins at 0. So, the index location [23, 3] selects the element that is 24 rows down and 4 columns over in the DataFrame.

Challenge - Range

What happens when you execute:

df_bmi[0:1]

df_bmi[:4]

df_bmi[:-1]

What happens when you call:

df_bmi.iloc[0:4, 1:4]

df_bmi.loc[0:4, 1:4]

How are the two commands different?

Subsetting Data using Criteria

We can also select a subset of our data using criteria. For example, we can select all rows that have a year value of 2002:



In [ ]:

    
df_regions[df_regions.region == 'Oceania']

Or we can select all rows that do not contain the year 2002:



In [ ]:

    
df_regions[df_regions.region != 'Oceania']

We can define sets of criteria too:



In [ ]:

    
df_regions[(df_regions.region == 'Asia') | (df_regions.region == 'Europe')]

Python Syntax Cheat Sheet

Use can use the syntax below when querying data by criteria from a DataFrame. Experiment with selecting various subsets of the "surveys" data.

Equals: ==
Not equals: !=
Greater than, less than: > or <
Greater than or equal to >=
Less than or equal to <=

Challenge - Queries
Select a subset of rows in the df_bmi DataFrame that contain data from the year 1999 and that contain mean bmi values less than or equal to 18. How many rows did you end up with?
You can use the isin command in Python to query a DataFrame based upon a list of values as follows:
df_bmi[df_bmi['iso'].isin([listGoesHere])]
Use the isin function to find all bmi records from a list of countries in the "df_bmi" DataFrame. How many records contain these values?
Experiment with other queries. Create a query that finds all rows with a bmi between 20 and 23.

The ~ symbol in Python can be used to return the OPPOSITE of the selection that you specify in Python. It is equivalent to is not in. Write a query that selects all rows with year NOT equal to 1996 or 2004 in the "df_bmi" data.



In [ ]:

	iso	sex	year	bmi	lower95ci	upper95ci	obese_pc	obese_lower95ci	obese_upper95ci
0	AFG	Men	1975	18.999440	16.331119	21.763077	0.002078	0.000384	0.006515
1	COM	Men	1987	21.212235	19.678790	22.755298	0.008317	0.002737	0.019086
2	COK	Women	1975	28.988165	26.359859	31.612574	0.359538	0.255615	0.469956
3	DJI	Women	1985	23.543410	20.425051	26.709342	0.099453	0.054829	0.156816
4	FRA	Women	1995	24.406912	23.795591	25.022224	0.159790	0.122057	0.202200

	Country	ISO	Sex	Year of birth	Mean height (cm)	Mean height lower 95% uncertainty interval (cm)	Mean height upper 95% uncertainty interval (cm)
0	Afghanistan	AFG	Men	1896	161.164095	154.484285	167.754033
1	Afghanistan	AFG	Men	1897	161.196286	154.571603	167.659618
2	Afghanistan	AFG	Men	1898	161.228297	154.707340	167.602576
3	Afghanistan	AFG	Men	1899	161.260727	154.835644	167.528113
4	Afghanistan	AFG	Men	1900	161.293068	154.959540	167.508077

	A	B	C
0	1.0	3	NaN
1	2.0	4	NaN
2	3.0	5	NaN
3	4.0	6	NaN
0	NaN	3	1.0
1	NaN	4	2.0
2	NaN	5	3.0
3	NaN	6	4.0

	A	B	C
0	1.0	3	NaN
1	2.0	4	NaN
2	3.0	5	NaN
3	4.0	6	NaN
0	NaN	3	1.0
1	NaN	4	2.0
2	NaN	5	3.0
3	NaN	6	4.0

	A	B	C
0	1.0	3	NaN
1	2.0	4	NaN
2	3.0	5	NaN
3	4.0	6	NaN
0	NaN	3	1.0
1	NaN	4	2.0
2	NaN	5	3.0
3	NaN	6	4.0